AITopics | historical frame

Collaborating Authors

historical frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Gong, Zhantao, Fan, Liaoyuan, Guo, Qing, Xu, Xun, Yang, Xulei, Li, Shijie

arXiv.org Artificial IntelligenceDec-12-2025

In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.18735

Country:

Asia > Singapore (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Learning World Models for Interactive Video Generation

Chen, Taiye, Hu, Xun, Ding, Zihan, Jin, Chi

arXiv.org Artificial IntelligenceOct-31-2025

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.21996

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Wu, Aodi, Luo, Xubo

arXiv.org Artificial IntelligenceOct-29-2025

This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.24152

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.94)
Automobiles & Trucks (0.94)
Information Technology > Robotics & Automation (0.85)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.86)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.85)

Add feedback

Uirapuru: Timely Video Analytics for High-Resolution Steerable Cameras on Edge Devices

Apostolo, Guilherme H., Bauszat, Pablo, Nigade, Vinod, Bal, Henri E., Wang, Lin

arXiv.org Artificial IntelligenceSep-3-2025

Real-time video analytics on high-resolution cameras has become a popular technology for various intelligent services like traffic control and crowd monitoring. While extensive work has been done on improving analytics accuracy with timing guarantees, virtually all of them target static viewpoint cameras. In this paper, we present Uirapuru, a novel framework for real-time, edge-based video analytics on high-resolution steerable cameras. The actuation performed by those cameras brings significant dynamism to the scene, presenting a critical challenge to existing popular approaches such as frame tiling. To address this problem, Uirapuru incorporates a comprehensive understanding of camera actuation into the system design paired with fast adaptive tiling at a per-frame level. We evaluate Uirapuru on a high-resolution video dataset, augmented by pan-tilt-zoom (PTZ) movements typical for steerable cameras and on real-world videos collected from an actual PTZ camera. Our experimental results show that Uirapuru provides up to 1.45x improvement in accuracy while respecting specified latency budgets or reaches up to 4.53x inference speedup with on-par accuracy compared to state-of-the-art static camera approaches.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3680207.3765260

2509.01371

Country:

Asia > China > Hong Kong (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation

Zhang, Lingfeng, Hao, Xiaoshuai, Xu, Qinwen, Zhang, Qiang, Zhang, Xinyao, Wang, Pengwei, Zhang, Jing, Wang, Zhongyuan, Zhang, Shanghang, Xu, Renjing

arXiv.org Artificial IntelligenceFeb-21-2025

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.13451

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

MALT: Multi-scale Action Learning Transformer for Online Action Detection

Yang, Zhipeng, Wang, Ruoyu, Tan, Yang, Xie, Liping

arXiv.org Artificial IntelligenceMay-31-2024

Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS'14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS'14 and an mcAP of 0.1% for TVseries.

action detection, detection, historical frame, (14 more...)

arXiv.org Artificial Intelligence

2405.20892

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.71)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation

Chen, Xuechao, Xu, Shuangjie, Zou, Xiaoyi, Cao, Tongyi, Yeung, Dit-Yan, Fang, Lu

arXiv.org Artificial IntelligenceAug-25-2023

LiDAR-based semantic perception tasks are critical yet challenging for autonomous driving. Due to the motion of objects and static/dynamic occlusion, temporal information plays an essential role in reinforcing perception by enhancing and completing single-frame knowledge. Previous approaches either directly stack historical frames to the current frame or build a 4D spatio-temporal neighborhood using KNN, which duplicates computation and hinders realtime performance. Based on our observation that stacking all the historical points would damage performance due to a large amount of redundant and misleading information, we propose the Sparse Voxel-Adjacent Query Network (SVQNet) for 4D LiDAR semantic segmentation. To take full advantage of the historical frames high-efficiently, we shunt the historical points into two groups with reference to the current points. One is the Voxel-Adjacent Neighborhood carrying local enhancing knowledge. The other is the Historical Context completing the global knowledge. Then we propose new modules to select and extract the instructive features from the two groups. Our SVQNet achieves state-of-the-art performance in LiDAR semantic segmentation of the SemanticKITTI benchmark and the nuScenes dataset.

artificial intelligence, information, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2308.13323

Country:

Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.35)
Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback